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ABSTRACT 



Context. In the next decade the Large Synoptic Survey Telescope will become a major facility for the astronomical community. 
However accurately determining the redshifts of the observed galaxies without using spectroscopy is a major challenge. 
Aims. Reconstruction of the redshifts with high resolution and well-understood uncertainties is mandatory for many science goals, 
including the study of baryonic acoustic oscillations (BAO). We investigate different approaches to establish the accuracy that can be 
reached by the LSST six-band photometry. 

Methods. We construct a realistic mock galaxy catalog, based on the GOODS survey luminosity function, by simulating the expected 
apparent magnitude distribution for the LSST. To reconstruct the photometric redshifts (photo-z's) we consider a template-fitting 
method and a neural network method. The photo-z reconstruction from both of these methods is tested on real CFHTLS data and also 
on simulated catalogs. We describe a new technique to efficiently remove catastrophic outliers via a likelihood ratio statistical test that 
uses the posterior probability functions of the fit parameters and the colors. 

Results. We show that the photometric redshift accuracy will meet the stringent LSST requirements up to redshift ~ 2.5 after a 
selection based on the likelihood ratio test or a selection based on the apparent magnitude, for galaxies with S/N > 5 in at least 5 
bands. The former selection has the advantage of retaining roughly 35% more galaxies for a similar photo-z performance compared 
to the latter. Photo-z reconstruction using a neural network algorithm is also described. In addition we utilize the CFHTLS spectro- 
photometric catalog to outline the possibility of combining the neural network and template-fitting methods. 

Conclusions. We demonstrate that the photometric redshifts will be accurately estimated with the LSST if a Bayesian prior probability 
and a calibration sample are used. 

Key words, cosmology - photometric redshift - large scale survey - LSST - CFHTLS 



1. Introduction 

The Large Synoptic Survey Telescope (LSST) has an optimal 
design for investigating the mysterious dark energy. With its 
large field of view, its short exposure time and its high transmis- 
sion bandpasses, the LSST will be able to observe a tremendous 
amount of galaxies, out to high redshift, over the visible sky from 
Cerro Pachon over ten years. This will lead to an unprecedented 
study of dark energy, among other science programs such as the 
study of the Milky Way and our Solar System (LSST Science 
|Collaboration|2009| ). 

One of the main systematic uncertainties in the cosmologi- 
cal analysis will be tightly related to errors in the photometric 
redshift (photo-z) estimation. Estimating the redshift from the 
photometry alone (Baum 1962 1 is indeed much less reliable than 
using spectroscopy, although it does allow measurements to be 
obtained for vastly more galaxies, in particular those that are 
very faint and distant. Photo-z estimates are mainly sensitive to 
characteristic changes in the galaxy's spectral energy distribu- 
tion (SED) such as the Lyman and the Balmer breaks at lOOnm 
and 400nm, respectively. Incorrect identifications between these 
two main features greatly impact the photometric redshifts, giv- 
ing rise to catastrophic outliers. Mis-characterizing the propor- 



tion of such outliers will strongly impact the level of systematic 
uncertainties. 

There are basically two different techniques to compute the 
photo-z. On the one hand, template-fitting methods (e.g. |Puschell| 
|et al.|1982||Bolzonella et al.|2000| > fit a model galaxy SED to the 
photometric data and aim to identify the spectral type, the red- 
shift and possibly other characteristics of the galaxy. It has been 
proven that using spectroscopic information in the template- 
fitting procedure, for example by introducing a Bayesian prior 
probability (Benitez 2000) or by modifying the SED template 
(Budavari et al. 2000, Ilbert et al. 2006) improves the photo-z 
quality. This highlights the necessity to have access to at least 
some spectroscopic data. 

On the other hand, empirical methods extract information 
from a spectroscopic training sample and are therefore generally 
limited to the spectroscopic redshift range of the sample itself. 
Among these, the empirical color-redshift relation ( |Connolly 
|et al.|1995)|Sheldon et al.|2012[) and neural networks ( |Vanzella 
|et al.|2004}|Col lister & Lahav 2004J are commonly used. 

In this paper, we address the issue of estimating the photo-z 
quality with a survey like the LSST, and in particular, we intro- 
duce a new method that aims to remove most of the galaxies 
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with catastrophic redshift determination (hereafter called out- 
liers). We utilize a galaxy photometric catalog simulated for a 
study of the uncertainty expected from the LSST on the dark 
energy equation of state parameter from baryon acoustic oscilla- 
tions (BAO). The related results will be presented in a compan- 
ion paper by Abate et al. in prep. 

Based on a Bayesian^ 2 template-fitting method, our photo-z 
reconstruction algorithm gives access to the posterior probabil- 
ity density functions (pdf) of the fit parameters. Using a training 
sample, the likelihood ratio test, based on the characteristics of 
the posterior pdf and the colors, is calibrated and then applied to 
each galaxy in the photometric sample. The technique is tested 
on a spectro-photometric catalog from the T0005 data release 
of CFHTLSpJmatched with spectroscopic catalogs from VVDS 



(|Le Fevre et al.|2004t|Ganiri et al.|2008|l, D EEP2 ( |Newman et ai] 
2Q\2) and zCOSMOS lLilly et al.||2007] >. We also outline the 
possibility to discard outlier galaxies by using photometric red- 
shifts estimated both from the template-fitting method and from 
a neural network. 

Finally we illustrate the modification to the systematic and 
statistical uncertainties on the photo-z when the redshift distri- 
bution of the training sample is biased compared to the actual 
redshift distribution of the photometric catalog to be analyzed. 

The paper is organized as follows. The LSST is presented in 
section|2j followed by our simulation method in section[3] In the 
latter, the simulation steps employed and the physical ingredi- 
ents required to produce the mock galaxy catalogs are described. 
In section [4] the mock galaxy catalogs for GOODS, CFHTLS 
and LSST are presented and validated against data for the for- 
mer two surveys. Our template-fitting method and the likelihood 
ratio test are described in section [5] The performance of our 
photo-z template-fitting method is shown in section [6] In sec- 
tion^ the photo-z neural network technique and its performance 
in conjunction with our template-fitting method is investigated. 
Finally, we give a brief discussion of the current limitations of 
our simulations in section|8] and conclude in section|9] 

Throughout the paper, we assume a flat cosmological 
KCDM model with the following parameter values: Cl m = 0.3, 
Q A = 0.7, Q. k = 0, H = 70 km/s/Mpc. 



2. The Large Synoptic Survey Telescope 

The LSST is a ground-based optical telescope survey designed, 
in part, to study the nature of dark energy. It should be one of the 
fastest and widest telescopes of the coming decades. The same 
data sample will be used to study the four major probes of dark 
energy cosmology: type la supemovae, weak gravitational lens- 
ing, galaxy cluster counts and baryon acoustic oscillations. 

It will provide unprecedented photometric accuracy with six 
filters (u, g, r, i, z, y). In the nominal design each filter is expected 
to have a transmission higher than 95% in the bandpass and 
lower than 0.01% outside, with very sharp edges. Moreover, the 
LSST CCDs are expected to have a high quantum efficiency 
compared to current detectors. The LSST will be a modified 



1 Based on observations obtained with MegaPrime/MegaCam, a joint 
project of CFHT and CEA/DAPNIA, at the Canada-France-Hawaii 
Telescope (CFHT), which is operated by the National Research Council 
(NRC) of Canada, the Institut National des Science de l'Univers of the 
Centre National de la Recherche Scientifique (CNRS) of France, and 
the University of Hawaii. This work is based in part on data products 
produced at TERAPIX and the Canadian Astronomy Data Centre as 
part of the Canada-France-Hawaii Telescope Legacy Survey, a collabo- 
rative project of NRC and CNRS. 



Table 1. Number of visits and 5<x limiting apparent magnitudes 
for one year and ten years of LSST operation.! Ivezic et al.| 2008 
Petty et al.|20T2] >. 



One year of observation 





u 


g 


r 


i 


z 


y 


Number of visits 


5 


8 


18 


18 


16 


16 


Limiting apparent 


24.9 


26.2 


26.4 


25.7 


25.0 


23.7 


magnitude 














Ten years 


of observation 












u 




r 


i 


z 


y 


Number of visits 


56 


80 


184 


184 


160 


160 


Limiting apparent 


26.1 


27.4 


27.5 


26.8 


26.1 


24.9 



magnitude 



Table 2. LSST photo-z requirements for the high signal-to-noise 
"gold sample" subset with i < 25.3. The parameters are defined 
as follows: cr z /(l + z) is the root-mean-square scatter in photo- 
z; r] is the fraction of 3<x outliers at all redshifts; e, is the bias, 
defined as the mean of (z p - z)/(l + z) at a given z, where z p is 
the photo-z. 



quantity 


requirement 


goal 


crjd + z) 

n 

\e z \ 


<0.05 

< 10% 

< 0.003 


< 0.02 



Table 3. Number of galaxies that are both observed photomet- 
rically with the CFHTLS survey and spectroscopically. 



Spectroscopic 
survey 


CFHTLS field 


No. of galaxies 


Refs. 


VVDS Deepl 


Wl &D1 


2011 


(1) 


DEEP2 Data Release 3 


W3 &D3 


5483 


(2) 


VVDS £22 


W4 


4485 


(3) 


zCOSMOS 


D2 


2289 


(4) 



References. (l)|Le Fevre et al.|j2004); ( 2) {Newman et al.|2012) ; (3) 
IGarilli et al.] l |2008| >; (4) |Lilly et aL|(20()7) . 



Paul-Baker telescope with three mirrors among which the pri- 
mary mirror measures 8.4 m diameter. The camera at the center 
of the tertiary mirror is composed of a field corrector made of 
three lenses, a carousel of five filters with an additional sixth 
filter and a focal plane array of 63.4cm diameter. The latter is 
composed of 189 science CCDs of 4096 x 4096 10 pm pixels, 
cooled down by a cryostat. The field of view of the LSST will be 
9.6 deg 2 and the survey should cover 30 000 deg 2 of sky visible 
from Cerro Pachon. 

The LSST will perform two back-to-back exposures of 2 x 
15 sec with a readout time of 2 x 2 sec. The number of visits and 
the 5<x limiting apparent magnitude in each band, for one year 
and ten years of the running survey are listed in table [T] With 
such deep observations, photometric redshifts will necessarily 
be computed in an essentially unexplored redshift range. 

The photo-z requirements as published in the LSST Science 
Book ( |LSST Science Collaboration|2009 1 are given in table [2] 

The final specifications of the LSST are subject to change; 
see |Ivezic et al.| ( [2~008| > for the latest numbers. 
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Fig. 1. SED templates linearly interpolated from the original six 
templates from Coleman et al. ( 1980| l and Kinney et al. ( 1996] >. 
The original templates are drawn in red. 



3. Simulation of galaxy catalogs 

This section begins by describing the main steps carried out in 
order to produce a catalog of galaxy photometric data. The rest 
of the section describes the ingredients used to generate these 
catalogs in more detail. 



3.1. Simulation steps 

3.1 .1 . Simulating galaxy distributions 

In order to simulate the galaxy catalog, we first compute the total 
number of galaxies N within our survey volume between abso- 
lute magnitudes Mi and M2. Then for each of these N galaxies, 
we assign redshifts and galaxy types. 

If (p is the sum of luminosity functions over the Early, Late 
and Starburst galaxy types (see section 
then the number of galaxies N g is given 



3.2.1 for more details), 



Jr-6 pM 2 
J Mi 



by 



<f>(M,z)H (l + zfd A {z) 2 E{zy x CldzdM . 



where M is the absolute magnitude in some band, d&(z) is the 
angular diameter distance, the function E(z) is defined by 

E(z) = V«.»(1+z) 3 +Oa(1+z) , 

and Q. (no subscript) is the solid angle of the simulated survey. 
The redshift range is chosen so as not to miss objects that may 
be observable by the survey. The exact choice of M\ and M2 is 
not critical because (i) at the bright limit the luminosity function 
goes quickly to zero; therefore as long as Mi is less than -24 
the integral is not affected, and (ii) as long as M2 is chosen to 
be fainter than the maximum absolute magnitude observable by 
the survey, then all galaxies that are possible to observe will be 
included; we chose M2 = -13. The redshift z s of each simulated 
galaxy is drawn from the cumulative density function: 



C z (Zs) 



<f>(M,z')H (l+z') 2 d A (z') 2 E(z'r l dz'dM 

Jo Jm, 

' (t>{M,z)HQ(\+zfd A {zfE(zr x dzdM 

Jo Jm, 



a 
- 

'0 



1) 
o 



c 

o 




Fig. 2. Extinction by dust as a function of A for three different 
color excess values: E(B - V) — 0.1 in black, E(B - V) = 0.2 in 
red and E(B-V) = 0.3 in green. The solid lines correspond to the 
Cardelli law (Cardelli et al. 1989 1. The bump around 217 nm can 
be explained by graphite or the presence of Polycyclic Aromatic 
Hydrocarbons (cf for example Clayton et al.|2003). The dashed 
lines correspond to the Calzetti law (Calzetti et al. 2000). 



Once the redshift of the galaxy, denoted by z s , is assigned, 
the absolute magnitude M is drawn from the following cumula- 
tive density function 



C M (M,z s ) 



f™cf>(M',z s )dM' 
J™ 2 <f>(M',z s )dM' 



Finally, a broad galaxy type is assigned from the observed 
distribution of each type at redshift z s and absolute magnitude 
M. This distribution is constructed from the type-dependent lu- 
minosity functions. Therefore each galaxy is designated a broad 
type value of either: Early, Late or Starburst. An SED from the 
SED library is then selected for each galaxy, according to the 
simulation procedure described in section [J.2.2| 

3.1 .2. Simulating the photometric data 

The simulated apparent magnitude /wx, s r] in any LSST band X, 
with transmission X(A), for a galaxy of SED type Tj^J, redshift 
Z s , color excess E(B - V) s and absolute magnitude M s , is gener- 
ated as follows: 

w XjJ = M s + fi(zs) + Kxy(z s , T s , E(B - V) s ) , 

where /-i(z s ) is the distance modulus and Kxy(z s , T s , E(B-V) s ) is 
the K-correction defined as described in Hogg ( 1999) for spec- 



tral type T s , with flux observed in observation-frame band X and 
M s in rest-frame band Y. It is expressed as 



-2.5 log 



K XY (z s ,T s ,E(B 
A 



1 I dAAT °{rh s 



V) s ) 

,E(B-V)AX(A) 



r dA 

J T 



Y(A) 



1 +z s 



dAAT s (A,E(B-V) s )Y(A) 



Then, the corresponding simulated flux Fx, s is computed as 

F x .s = F xfi l0- QAm *<, 
where 



"x,o 



I 



c X(A)/AdA. 



- Subscript s stands for the simulated value. 

3 Here type T s refers to the actual SED of the galaxy, and not the 
broad type value e.g Early, Late or Starburst. 
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xio' 



Fig. 3. LSST transmission curves shown by the solid lines and 
CFHTLS transmissions shown by the dashed lines. The trans- 
mission takes into account the transmission of the filter itself, 
the expected CCD quantum efficiency, the telescope optical 
throughput and the mean atmospheric transmission. 



c is a constant that depends on the luminosity of the galaxy, the 
units the flux is measured in and the zeropoint of the AB magni- 
tude system. The uncertainty on the flux is then given by 

<r(F x , s ) = 0Acr x (m x , s )F x Jn(10). 

The simulated observed flux F x , bs is drawn from a Gaussian 
with mean F Xs and standard deviation cr(Fx, s ). This is correct 
as long as the flux is large enough to be well distributed with a 
Gaussian distribution. The uncertainty cr x (m x s ) on true magni- 
tude in band X is given by Eq.[2]in section [3~.2.5| 

Note that the apparent magnitude uncertainty cr x {m x J) de- 
pends on the number of visits N x ^ s . We have performed the sim- 
ulation for two sets of values of Nx,vis mat correspond to one and 
ten years of observations with the LSST, according to the Nx, V is 
given in table [T] 

Throughout the paper the quantity z s refers to the simulated 
or "true" value of the redshift. Here we also assume that a spec- 
troscopic redshift obtained for one of the simulated galaxies has 
a value equal to z s . Therefore the value z s can be also considered 
to be the galaxy's spectroscopic redshift with negligible error. 



3.2. Ingredients 

3.2.1. Luminosity function 

The luminosity function describes probabilistically the expected 
number of galaxies per unit volume and per absolute magnitude. 
If the luminosity functions are redshift- and type-dependent, 
then they give the relative amount of galaxies for each galaxy 
type at a given redshift. 

We use luminosity functions measured from the GOODS 
survey ( |Dahlen et aTpOOB) 



The luminosity functions here are 
modeled by a parametric Schechter function that takes the form: 



<f>(M) 

y 



0.4 ln( 10)0 V a+1) exp(-y) 

2Q-0.4(M-M,) 




Fig. 4. Histograms of the apparent magnitude in the R band com- 
paring the GOODS simulated data (black points with errorbars) 
to the actual GOODS data (red stars). 

3.2.2. Spectral Energy Density (SED) library 

We built an SED library composed of 51 SEDs. They were 
created by interpolating between six template SEDs, described 
here: 

- the Early-type El, the Lafe-types Sbc, Scd and the Starburst- 
type Im from Coleman et al. ( 1980} , 

- the Starburst-types SB3 and SB2 from Kinney et al. ( 1996 ). 



These six original SEDs were linearly extrapolated into the UV 
by using the GISSEL synthetic models from |Bruzual & Charlot| 
(2003). The interpolated spectra of the 51 types are displayed in 
Fig [1J In the following, we denote by T s the true spectral type 
(SED) of the galaxy. 

Each galaxy is assigned an SED from this library using 
a flat probability distribution based on their broad type value, 
originally assigned as either Early, Late or Starburst (see sec- 
. This way of generating the spectra may not be as op- 



3.1.1 



where M is the absolute magnitude in the B-band of GOODS 
WFI, and M*, 0*, and a are the parameters defining the function. 
Their values can be obtained from Dah len et al.| ( |2.005] >. 



tion 

timal as using more realistic synthetic spectra, but it has heuris- 
tic advantages. For example, there is an easy way to relate the 
luminosity function main galaxy type to an SED type, and addi- 
tionally it is much faster, in terms of computing time, to produce 
a large amount of galaxy spectra at different evolutionary stages. 
We are aware, however, that this linear interpolation may bias 
photometric redshifts estimated using a template-fitting method, 
because real galaxies are probably not evenly distributed across 
spectrum space. Therefore this feature may allow the neural net- 
work method to be more effective in estimating the redshift. 



3.2.3. Attenuation by dust and Intergalactic Medium (IGM) 

The reddening caused by dust within the target galaxy is quanti- 
fied in our simulation by the color excess term E(B-V). Together 
with this term, the Cardelli law ( |Cardelli "e t al. 1989) is used 
for the galaxies closest to the El, Sbc, and Scd spectral types, 
whereas the Calzetti law is used for the galaxies closest to the 
Im, SB3, and SB2 spectral types. In Fig. [2] the reddening laws 
are plotted for three values of the color excess. The parameter 
E(B - V) is drawn from a uniform distribution between to 
0.3 for all galaxies, except for galaxies closest to the El type. 
Elliptical galaxies are composed of old stars and contain little or 
no dust, therefore E(B - V) is drawn only between to 0. 1 for 
these galaxies. 

Another process to be considered is the absorption due to 
the intergalactic medium (IGM). It is caused by clumps of neu- 
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20<r<22 24<r<25.5 




Fig. 5. Histograms of the b — r term for different apparent magnitude ranges. Left-hand panel (20 < r < 22) corresponds to the 
bright galaxies and right-hand panel (24 < r < 25.5) corresponds to the very faint galaxies. The solid black lines correspond to the 
simulation and the solid red lines correspond to the GOODS data. The dotted colored lines correspond to the main spectral types in 
the simulation. 



tral hydrogen along the line of sight, and is well-modeled by 
the Madau law (Madau 1995| l. As the absorption occurs at a 
fixed wavelength in the hydrogen reference frame, it is redshift - 
dependent in the observer frame. Strong features in the optical 
part of the SEDs are induced by the IGM at redshifts above 
about z ~ 2.8 in the LSST filter set, when the Lyman-o- forest 
has shifted into the LSST band passes. Here we assume this ab- 
sorption to be constant with the line of sight to the galaxy. An 
investigation of the effect of the stochasticity of the IGM will be 
the subject of future work. 

3.2.4. Filters 

The six LSST bandpasses displayed in Fig. [3] take into ac- 
count the quantum efficiency of the CCD, the mean atmospheric 
transmission, the filter transmission, and the telescope optical 
throughput. The CFHTLS filter set^Jis also displayed in the same 
figure. We expect to be able to obtain good photo-z estimates up 
to redshifts of about 1.2 for CFHTLS and 1.4 for LSST when 
the 4000 A break is redshifted out of the filter sets. At redshifts 
above 2.5 the precision should improve dramatically when the 
Lyman break begins to redshift into the w-band. 

3.2.5. Apparent magnitude uncertainties for the LSST 

The apparent magnitude uncertainties for the LSST are com- 
puted following the semi-analytical expression from the LSST 
Science Book ( LSST Science Collaboration|2009 1. This expres- 
sion has been evaluated from the LSST exposure time calculator, 
that takes into account the instrumental throughput, the atmo- 
spheric transmission, the airmass among other physical parame- 
ters. 

The random error on the (true) magnitude nix (subscript s 
has temporarily been dropped) can be written as 

(1) 



rand,X 



(Qm-y x )x + y x x 2 



where the value of jx is reported in the LSST Science Book for 
each filter X. The variable x is defined as follows: 

X = lQ 0A ( m x- m 5,x) 



The CFHTLS transmissions have been downloaded from 



http : //wwwl . cade- ccda . hia- iha . nrc- enre . gc . ca/ 
community/CFHTLS-SG/docs/extra/f liters. html. 



I I D2 

3W1 &D1 
WZ\ W4 
I I W3 & D3 




1.4 1.6 

redshift 

Fig. 6. Redshift distribution of the spectroscopic sample for the 
different CFHTLS fields. The histograms are stacked. 

where m^^x is the 5cr limiting apparent magnitude in the X band, 
such that o-(ni5 x) = 0.2, and is given by the following expression 

m 5 jf = C x + 0.5(m^ v - 21) + 2.5 log(0.7/6£ e ) + 1.25 log(? v /30) 

-k x (A - 1). 

where the parameter Cx accounts for the total instrumental 
throughput; m^ y is the sky background apparent magnitude; 6^ e 
is the seeing; t v is the exposure time in seconds; kx is the atmo- 
spheric extinction; A is the air mass. All of these parameters are 
reported in the LSST Science Book for median observing condi- 
tions. 

The total uncertainty on the apparent magnitude includes a 
systematic uncertainty that comes from the calibration, such that 
the photometric error in band X is 



crx 



(m x ) = Jcr 



rand,X 



sysJC 



(2) 



where cr sys X is taken to be equal to 0.005 and is the photometric 
systematic uncertainty of the LSST for a point source. We have 
adopted this simple formula defined for point sources, and have 
used it for extended sources. A more realistic computation of 
this uncertainty for extended sources will be completed in future 
work. 
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0.2 0.4 0.6 



1 1.2 1.4 1.6 

redshift 




0.2 0.4 0.6 0J 



1 1.2 1.4 1.6 

redshift 



Fig. 7. Evolution of the colors u - g and i — z as a function of redshift computed from the K-correction (colored lines) and measured 
from the CFHTLS data (black dots). The departure of the computed colors from the data is not significant since not all the galaxy 
types exist in significant numbers within the CFHTLS data at all redshifts. 
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Fig. 8. Left-hand side: Cumulative distribution of the apparent magnitude in the ; band for the LSST simulation compared to the 
SDSS measurement from the "Stripe 82" region (cf Abazajian et al. 2009 ). Right-hand side: Redshift distribution for the LSST for 
i < 24, i < 25, i < 27 with <x ; < 0.2, and for galaxies with S IN > 5 in all bands, both for one and ten years of observations. 



3.2.6. Apparent magnitude uncertainties for CFHTLS and 
the GOODS survey 



An analytical expression similar to the one given by Eq.|2]does 
not exist for the CFHTLS and the GOODS data. The apparent 
magnitude uncertainties are estimated with algorithms and anal- 
ysis techniques specific to these surveys and as such the relations 
in the previous section will not apply. 



In the following, where simulations of photometric galaxy 
catalogs of both of these surveys will be carried out, the simu- 
lated uncertainties on the apparent magnitudes will be estimated 
directly from the survey data themselves. In this way, one can 
obtain the probability distribution of having crx given nix- This 
will allow the assignment of cr x , by randomly drawing the value 
according to this probability density function, given the value of 
m x . 



4. Method validation 



4.1. GOODS 



In order to validate the simulation scheme, we have performed a 
simulation of the GOODS WFI data^j and compared our results 
to the real photometric catalog used for the computation of the 
B-band luminosity functions reported in Dah len et al.|p005) . 

The simulated photometric catalog corresponds to an effec- 
tive solid angle of 1100 arcmin 2 , that is equal to the area cov- 
ered by the actual data catalog. The simulated redshift and ab- 
solute magnitude ranges are respectively [0,6] and [-24,-13]. 
The apparent magnitudes are computed for the WFI Z?-band and 
7?-band. The apparent magnitude uncertainty is now given by 
the distribution of crx given mx computed from the real data 
(see 3.2.6| l. If m x , s is the simulated apparent magnitude in any 
X-band, the uncertainty is randomly drawn from the distribu- 
tion Prob(o- x \m Xs ) found from the data. The observed apparent 



5 Data from the Wide Field Imager (WFI) on the 2.2-m MPG/ESO 
telescope at La Silla. 
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Table 4. Values of the fit parameters of Eq. ||and @. 



magnitude and its uncertainty are then simulated as detailed in 
section [3X2] 

Figure[4]shows the very good agreement of the galaxy num- 
ber counts in the /?-band between the simulation and the real 
data, except for the very faint galaxies at R > 25 for which the 
selection effect in the real data becomes important. The agree- 
ment is also very good for the Z?-band (not shown here). As dis- 
played in Fig. [5] the distribution of colors from our GOODS 
simulation (black lines) are in reasonable agreement with the 
ones from the real data (red lines), for bright and faint galax- 
ies. At bright magnitudes, the fitted luminosity function seems 
to predict a larger fraction of elliptical galaxies than the data. 
This feature probably comes from the SED templates and their 
linear interpolation. We have chosen to interpolate linearly be- 
tween the SED templates with an equal number of steps between 
each template. With real galaxies it could easily be that this is not 
the case, e.g. perhaps there is not a uniform chance of a galaxy's 
SED lying between the El and the Sbc galaxy type, but instead 
there could be a higher chance it is more similar to the Sbc. This 
could explain our excess. However, the overall shape of the dis- 
tributions indicates our simulated photometric catalog represents 
reality. This is expected because the luminosity functions were 
computed from the real data sample used for the comparison to 
the simulation. 

4.2. CFHTLS 

The different photo-z reconstruction methods were tested on real 
data, namely on galaxies observed both photometrically (from 
CFHTLS T0005) and spectroscopically (from either VVDS, 
DEEP2 or zCOSMOS surveys). We have followed the procedure 

~ p009) . 



described in detail by Coupon et al 



The CFHTLS T0005 public release contains a photometric 
catalog of objects observed in 5 bands (u,g, r, i,z) from the Dl, 
D2, D3, Wl, W3 and W4 fields. Among these, some galaxies are 
also spectroscopically observed by the VVDS, DEEP2 Redshift 
Survey and zCOSMOS surveys. The numbers of galaxies in the 
CFHTLS fields that were matched to spectroscopic observations 
are listed in Table [3] To perform the matching, the smallest an- 
gle between each galaxy in the CFHTLS catalog and a galaxy 
from the spectroscopic catalogs was computed. Only galaxies 
for which the angle is less than 0.7 arcsec (the order of the PSF) 
are grouped into the spectro-photometric catalog. 

The spectroscopic redshift distribution of the spectro- 
photometric catalog is shown in Fig. [6] 

A simulation of the CFHTLS data was also performed. This 
was to enable us to evaluate whether the statistical test described 
in section 5.2 can be calibrated with a simulation and then ap- 
plied to real data, for which the photometric redshifts of galaxies 
are not known. Such a procedure would be useful when no spec- 
troscopic sample is available to calibrate the prior probabilities 
or the likelihood ratio statistical test presented in section 5.2 The 



same argument stands for the neural network analysis described 
in section [7] which could also benefit from a simulated training 
sample. 

Because the selection function of the spectro-photometric 
catalog is not based only on the detection threshold, the redshift 
and color distributions of the simulation and real data cannot be 
rigorously compared. The photometric selection criteria is not 
consistent over this data sample due to the differing selection of 
VVDS, DEEP2 and zCOSMOS, and therefore will not match 
the simulation. Since we also don't know how these selection 
criteria vary we cannot consistently compare the simulation to 
the data. However, the theoretical colors can be compared to the 



Spectral family: Early 



Late 



S tarburst 



LSST 



a 


2.97 ± 0.01 


-1.57 ±0.48 


1.32 


±0.00 


zo 


0.00 ± 0.00 


0.12 ±0.02 


0.00 


±0.00 




0.13 ±0.00 


0.08 ± 0.00 


0.08 


±0.00 


ft 


0.22 


0.41 








0.27 ± 0.00 


0.10 ±0.00 







CFHTLS 



a 3.51 ±0.02 2.77 ±0.01 1.89 ±0.00 

zo 0.00 ±0.00 0.18 ±0.02 0.02 ±0.01 

k,„ 0.13 ±0.00 0.02 ±0.01 0.08 ± 0.00 

Pm 0.18 ±0.02 0.02 ±0.01 

P ... 1.71 ±0.08 3.12 ±0.23 

/, 0.23 0.43 

k, 0.27 ±0.00 0.11 ±0.00 



ones given by real data. To do so, the K-corrections are com- 
puted for each main galaxy type T s , and this allows us to infer 
the colors as a function of the redshift. These are displayed in 
Fig. [7] for the u - g and i - z terms. The g — r and r - i exhibit 
a similar behavior. The data points and the theoretical curves are 
in agreement over the whole redshift range. 

4.3. LSST forecasts 

The simulated LSST sample considered here is the same as the 
one used in the companion paper Abate et al. in prep.. It is gen- 
erated with a solid angle of 7850 deg 2 and a redshift range of 
[0.1, 12]. This upper redshift limit is large enough to include all 
observable galaxies. The total number of galaxies is ~ 8 x 10 9 . 
For the purpose of this paper, namely the reconstruction of pho- 
tometric redshifts, such a large number of galaxies is not nec- 
essary. Therefore the following analysis is done with a smaller 



subsample of galaxies, see section 6.2 

The expected cumulative number counts per unit of solid an- 
gle and per i band apparent magnitude for the LSST is shown on 
the left-hand side of Fig. [8] and is compared to the SDSS mea- 
surements made on the "Stripe 82" region (cf Abazajian et al. 
2009). The LSST galaxy count is generally below that of SDSS, 
with a more pronounced effect at high and low end magnitudes 
(m ~ 26 and m ~ 20). There was no specific matching of the 
GOODS luminosity functions to the SDSS so we expect there to 
be some discrepancy due to comparing to a distribution that is 
different from the one it was drawn from. 

The expected number of galaxies per unit of solid angle and 
per redshift is shown on the right-hand side of Fig.|8]for different 
cuts: i magnitude cuts of i < 24, i < 25, i < 27 with ux < 0.2 
for all X bands, for both one (dotted lines) and ten years (solid 
lines) of observations. The LSST gold sample is defined as being 
all galaxies with i < 25.3 and is expected to produce high quality 
photometric redshifts. 

The number of galaxies with S/N> 5 in all six bands at ten 
years is fairly small because these constraints are strong for the 
low acceptance u and y bands. Low S/N in the w-band is ex- 
pected both from its shallower depth and from dropout galaxies 
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Fig. 9. Example of photometric computation for a simulated galaxy observed with the LSST in six bands at 5cr for ten years 
of observation. The 2D distributions correspond to the posterior probability density functions marginalized over the remaining 
parameter, and the ID distributions correspond to the posterior probability density functions marginalized over the two remaining 
parameters. The top middle box corresponds to the value of the input parameters. On the top right-hand panel, the index grid 
denotes the parameters that maximize the 3D posterior probability density function on the grid, and on the middle right-hand panel 
the index marg denotes the parameters that maximize the ID posterior probability density functions. The size of the grid cells has 
been reduced, and the z p axis has been shortened compared to the size of the grid usually used to compute the likelihood function. 



i^Ibi^iIiiiIi 



i. I. . mI, m, l„, , hi^li!,, ^mli 



2 3 4 5 6 7 8 
N P.( T P ) 




Fig. 10. Probability density function of the reduced variables 
Np(z p ), N p (T p ) and g - r. The black lines correspond to P{p:\G) 
and the red lines to P(yL-\0). 



at higher redshift where hydrogen absorption removes all flux 
blue-ward of the Lyman break, and non-detections above z > 3 
are expected. The low S/N in the y-band is expected just from its 
shallower depth. 



5. Enhanced template-fitting method 

5.1. Maximum of the posterior probability density function 

In this section, our template-fitting method for estimating pho- 
tometric redshifts is presented. The algorithm follows the ap- 
pr oach developed by |Ilbert et a"L] ( |2006[ ), [Bolzonella et aL] ( |2000) 
or |Benitez| ( |2lJ00] l. 

Basically, the template-fitting method consists in finding the 
photo-z z p , the SED template T p , the color excess E(B - V)^ 
and the SED normalization N that give the fluxes in each banc 
that best fit to the observed values. Following Benitez (2000), 
the normalization parameter is marginalized over, so that the pa- 
rameters of interest are given by the minimum of a x 1 statistic 
whose expression will be given later in this section. 

A Bayesian prior probability can be used to improve the pho- 
tometric redshift reconstruction. It is defined as the probability of 
having a galaxy of redshift z and type T, given its apparent mag- 
nitude. It was introduced by |Benitez| ( |2000| l. Bayes' Theorem 
indicates that this probability can be expressed as the product of 
the probability of having a galaxy of type T given the apparent 
magnitude i: P(T\i) times the probability of having a galaxy of 
redshift z, given the type and the apparent magnitude: P(z\T,i). 
In other words 

P(z,T\i) = P(z\T,i)xP(T\i). 



Subscript p refers to the "best-fit" parameters. 
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Fig. 11. Likeihood ratio distribution from the LSST simulation 
training sample. Logarithm of the probability density P(Lr\G) in 
solid black and P(Lr\0) in dashed red. 



The two terms are well described by the functions: 

P(T\i) = f,e k,{i - 2Q) , 

and 

P(z\T,i) oc z ff exp 

Zm = ZO 



(3) 



+ k m (i - 16) + p m (i - \6f 



(4) 



Here, T represents the spectral family (broad type) instead of 
the spectral type (the exact SED), i.e. galaxies with spectral 
type lower than 5 belong to Early-type, those with spectral type 
between 6 and 25 belong to Lafe-type and the rest belong to 
Starburst-type. This parametrization follows a general model 
for galaxy number counts with redshift, and is improved to ac- 
count for higher redshifts in CFHTLS. The parameters of P(T\i) 
and P(z\T, i) are found from fitting Eq. [3] and Eq. [4] to the sim- 
ulated magnitude-redshift distributions for both the LSST and 
CFHTLS surveys. The value of the parameters in Eq.[3]and Eq.[4] 
are given by table [4] The fitted p m parameter for the LSST is 
compatible with zero, while it is meaningful for CFHTLS (see 
table [4]), it was therefore set to zero when representing the prior 
probability for the LSST simulations. There is no value for the 
parameters /, and k, for Starburst galaxies, whose probability 
distribution is set by the condition that the sum of probabilities 
over all galaxy types must be equal to 1 . 

When prior probabilities are taken into account, the x 2 is ex- 
tended and is defined as 



X 2 (z,T,E(B-V)) = J] 
x ^ 



Fx, 



obs 



AF x , pr {z,T,E(B-V)) 



v(F x ,obs) 

2\n{B)2\n(P(T\i)xP{z\T,i)) 



(5) 



where Fx,obs is the observed flux in the X-band; F XtPr (z, T, E(B - 
V)) is the expected flux; Nband is equal to 5 for the CFHTLS, and 
is equal to 6 for the LSST; cr (Fx,obs) is the observed flux uncer- 
tainty. The terms A and B come from the analytical marginal- 
ization over the normalization of the SEDs; they are defined as 
follows: 



Nbona 



77 77 ^band 77 17 

rX,obsrX,pr r x,p r r X,pr 



cr{Fx,ob s ) 



N band 77 17 



X,pr^X,pr 



r[ o~(F x ,obs) 



(6) 



In the following, the 3D posterior pdf is defined as 
X. - exp [ — /f 2 /2]. It is computed for each galaxy on a 3D grid 
of 100 x 25 x 5 nodes in the (z, T, E(B - V)) parameter space. 
The values of the parameters z, T, E(B - V) lie in the intervals: 
[0,4.5], [0, 50], [0, 0.3] respectively. Since we are controlling the 
domain of possible parameter values to match the ranges used 
to make the simulation, we are probably reducing the number 
of possible degeneracies in z, T, E(B - V) space. The SED li- 
brary used for the photo-z reconstruction is the CWW+Kinney 
library described in Section [3. 2. 2| However, when considering 
the CFHTLS spectro-photometric data, the templates have been 



o-(F x ,„b s ) 



optimized following the technique developed by Ilbert et al. 
(2006); therefore naturally they match the data better. 

The probability distribution is a function of 3 parameters; 
to derive the information on just one or two of the parameters 
we integrate the distribution over all values of the unwanted pa- 
rameter(s) in a process called marginalization. The marginal- 
ized 2D probability density functions of the parameters (z, T), 
(z, E(B - V)), (T, E(B - V)) are computed in this manner, as well 
as the marginalized ID probability density function of each pa- 
rameter. Figure [9] shows an example of these probability density 
functions for a galaxy with true redshift z s = 0.16, true type 
T s = 45 (Starburst) and true excess color E(B - V) s = 0.24. In 
many cases, the 3D posterior pdf is highly multimodal; minimiz- 
ing the x 1 with traditional algorithms such as Minuet therefore 
often misses the global minimum. A scan of the parameter space 
is better suited to this application. Even a Markov Chain Monte 
Carlo (MCMC) method, which is usually more efficient than a 
simple scan, is not well suited for a multimodal 3D posterior 
pdf. Moreover, the production of the chains and their analysis in 
a 3D parameter space is slower than a scan. This example where 
(z p - Zj)/(1 + z s ) — 0.27 corresponds to a catastrophic recon- 
struction. In this case, the parameters (z*" rf , T^" d , E(B - V) 8 ^ J 
that maximize the 3D posterior pdf grid do not coincide with the 
ones that maximize the individual posterior probability density 
functions, namely the parameters (z™ rg , T™ rg , E(B - V)™ r£ ) • 



5.2. Statistical test and rejection of outliers 

In this section, we outline a statistical test that aims at rejecting 
some of the outlier galaxies, for which \z p - z. s |/(l + z s ) > 0.15. 
It is based on the characteristics of the ID posterior probability 
density functions P(z p k), P(T p k) and P(E(B - V) p k). The test is 
calibrated with a training sample, for which the true redshift is 
known (or the spectroscopic redshift in the case of real data). In 
the following, the LSST simulation for ten years of observation 
is considered in order to illustrate the method. 



5.2.1. The probability density function characteristics 

The variables considered in order to establish the statistical test 
are the following: 

- The number of peaks in the marginalized ID posterior prob- 
ability density functions denoted by N p k(0), where 6 is either 
z, T or E(B -V). 

- When N p k > 1, the logarithm of the ratio between the height 
of the secondary peak over the primary peak in the ID pos- 
terior probability density functions, denoted Rl{6) ■ 
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Fig. 12. From the LSST simulation training sample. Top panel: 
evolution of the the rejection vs. L R . Bottom left-hand panel: 
evolution of the acceptance vs. the rejection. Bottom right-hand 
panel: evolution of the acceptance vs. Lr. 



- When N P k > 1, the ratio of the probability associated with 
the secondary peak over the probability associated to the pri- 
mary peak in the ID posterior probability density functions, 
denoted by R p k{0). The probability is defined as the integral 
between two minima either side. 

- The absolute difference between the value of z p i and z" arg 



marg i 



pi 



where z^ arg is the redshift that 



denoted by D pk = \z pk „ pk 
maximizes the posterior probability density function P(z). 

- The maximum value of log(X). 

- ThecolorsC = (u—g,g—r, r-i, i-z) in the case of CFHTLS, 
and with an extra z-y term in the case of the LSST. 

We denote by O the galaxies that are considered as outliers 
and G the galaxies for which the redshift is well reconstructed in 
the following way: 

- O: | Zp -z I |/(l+ Zf )>0.15 

- G: |z J ,-z I |/(l+ Zf )<0.15 

We define the set of variables defined in the list above are de- 
noted by the vector fi. From a given training sample, we com- 
pute the distributions P(//,|<9) and P(p.j\G). For convenience we 
adopt reduced variables that are renormalized to lie between 
and 1 . Distributions of some of the reduced variables are plotted 
in Figs. 10 It is clear that the distributions P(ju,|G) and P(yU,|<9) 
are different. The probability that an outlier galaxy O presents 
more than three peaks in its posterior probability density func- 
tion P(z p ) is larger than for a well reconstructed galaxy G. A 
combination of these different pieces of information leads to an 
efficient test to discriminate between good and catastrophic re- 
constructions. 



5.2.2. Likelihood ratio definition 

In order to combine the information contained in the densities 
P(fij\G) and P(//,|<9), we define the likelihood ratio variable Lr\ 



LrQx) 



+ 

CJ 1.5 



I 



Fig. 13. From the LSST simulation training sample. 2D distribu- 
tion of (z p - Z s )/(1 + z s ) as a function of the threshold L^ c . In 
each bin in L C R , the distribution of (z p -Z S )/(J +z s ) is normalized. 
We find L R c > 0.98 to be a good choice. 



Data f Case A) 
Simulation (Case B) 




Fig. 14. Histogram of the number of galaxies from the CFHTLS 
sample with L« > L R c as a function of L R c . The black curve 
has been obtained with densities P(ji\G) and P(fi\0) computed 
from the CFHTLS data themselves whereas the red curve relies 
on densities obtained from the CFHTLS simulation. 



where 



P(fi\G) = Y]p( Mi \G), 

i=i 

P(M\0) = Y]P(Mi\0), 



where N p is the number of components of fi. Here, the vari- 
ables fij are assumed to be independent. The correlation matrix 
of the yu, 's shows indeed a low correlation between the parame- 
ters. We approximate the two probabilities P{G\fi) and P(0\fi) as 
the product of P(ji\G) and P(fi\0), neglecting the correlations, as 
our aim is just to define a variable for discriminating the two pos- 
sibilities. The probability density functions P(Lr\G) and P(Lr\0) 
are computed from the training sample, and are displayed in 
Fig. 
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P(p\G) + P(fi\0) 



The results shown here are from the LSST simulation 
for ten years of observations with nix < ms,x- As expected, the 
two distributions are very different. 



10 



A. Gorecki et al. : Photometric redshift reconstruction techniques and methods to reject catastrophic outliers. 



The quality of a discrimination test, such as Lr > Lr c , can 
be quantified by the acceptance Acc and rejection Re j rates: 



Acc(L, 



*)= f P{L' R \G)dL' R , 

R )= f " P(L' R \0)dL' R 
Jo 



Rej(L 



The evolution of Acc and Re j as functions of Lr and of Acc as a 
function of Re j is displayed in Fig. [12] The larger the difference 
between the curve Acc vs. Re j and the curve Acc = 1 - Re j, 
the higher the rejection power. Figure 12 shows that the method 
should work because the solid line lies far from the diagonal dot- 
ted line in the bottom left panel. A high value of L R is necessary 
to discard outliers, however it should be chosen so a minimum 
of well-reconstructed galaxies are removed. The distribution of 
(z p - + z s ) as a function of the cut on L R , namely L R c , 

is displayed in Fig. 
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which is a visual illustration of the Lr 
cut effectiveness. The plots in Figs. [15]&[T6] discussed below, 
show that there is a significant improvement when a cut on Lr is 
applied. 



6. Photo-z performance with template-fitting 

In the following sections, the ability of the statistical test, based 
on the Lr variable, to construct a robust sample of galaxies with 
well reconstructed redshifts is investigated in more detail, both 
for the CFHTLS spectro-photometric data and the LSST simu- 
lation. The efficiency of the photo-z reconstruction is quantified 
by studying the distribution of (z p - z s )/(l + z s ), through the: 

- bias: median that splits the sorted distribution in two equal 
samples. 

- rms: the interquartile range (IQRQ If the distribution is 
Gaussian, it is approximately equal to 1.35<x where <x is the 
standard deviation. 

- j/: the percentage of outliers for which \z p -z s \/(l+z s ) > 0.15. 

Table[2]gives the LSST requirements for these values. 



6.1. Results for CFHTLS 

In this section, the reconstruction of the photometric redshifts, 
and the consequences of the selection on Lr for the spectro- 
photometric data of CFHTLS, are presented in order to validate 
the method. Two cases are considered: 

- Case A: The distributions P(fi\G) and P(ji\0) are computed 
from the data themselves. 

- Case B: The distributions P(fi\G) and P(jx\0) are computed 
from a simulation of the CFHTLS data, as explained in sec- 
tion FO 



of Lr c for Case A (solid lines) and B (dashed lines). The photo- 
metric redshifts are clearly improved if a large enough value of 
Lr is applied (i.e. Lr > 0.9), even in Case B. However, within 
the galaxies that are removed, there are many well reconstructed 
galaxies. 

In the top panel of Fig [15] larger Lr values remove more 
galaxies from the lowest redshift bins, indicating that probably 
Lr > 0.6 is required to remove outlier galaxies that have been 
assigned a low redshift due to e.g. confusion between the Lyman 
and Balmer breaks. In the second panel the Lr selection does 
not have a very strong effect on the bias, probably because our 
definition of bias is not heavily affected by outliers. The rms 
on the other hand, is improved by up to a factor of 3 at higher 
redshifts, and similarly the catastrophic outlier rate is also im- 
proved by up to a factor of 3 at higher redshifts. However, the 
efficiency of the Lr selection on the CFHTLS data is signifi- 
cantly below that on the LSST simulated data (see the follow- 
ing subsection). The CFHTLS spectro-photometric sample used 
here will predominantly contain bright, well measured galaxies, 
because these are the kind generally selected for spectroscopic 
follow-up. Therefore we expect there to be far fewer galaxies in 
this sample that have outlier photometric redshifts, and as such 
the Lr calibration is less robust, leading to the removal of too 
many well reconstructed galaxies. 

In Fig.[6]we see that the spectroscopic sample redshift cover- 
age barely extends beyond z > 1.4. This means that when the Lr 
selection was calibrated on the CFHTLS data, the sample was 
missing outlier galaxies with high redshifts that could be esti- 
mated erroneously to be at low redshifts, e.g. those subject to 
the degeneracy causing the Lyman break to be confused for the 
4000 A break. The LSST simulated data will contain these de- 
generacies, however it would be good to test the L R selection 
method on real data out to higher redshifts in the future. 

The agreement between the solid (Case A) and dashed (Case 
B) lines in Fig. 15 show that, if no spectroscopic survey is 



In both cases, the photo-z is computed from the real CFHTLS 
data. In order to compare the effect of a selection on Lr, in par- 
ticular to keep the same total number of galaxies for a particular 
selection, different values of Lr c have to be chosen for Case A 
and Case B. For a given number of galaxies left in the sample, 

the values of Lr c for both cases can be obtained from Fig. fit] 6.2.1 . Observation in six bands 



available, a simulation that reproduces the colors of the pho- 
tometric catalog can be used to calibrate the likelihood ratio 
test. The model used to compute the photo-z, that is to say the 
CWW+Kinney library, the Calzetti and Cardelli extinction laws, 
correctly accounts for the data. 



6.2. Results for LSST 

We use a total of 50 million galaxies in our simulated catalog. 
This catalog is divided into 5 different sets. Each set is separated 
into a test sample (2 million galaxies) and an analysis sample 
(8 million galaxies). In each set, the statistical test is performed 
on "observed" galaxies within the test sample, then the densi- 
ties P(ji\G) and P(fi\0) are used to compute the value of Lr for 
"observed" galaxies in the analysis sample. Performing the re- 
construction on the 5 independent sets give us a measure of the 
fluctuation from a set to another and thus an estimate of the er- 
ror on our reconstruction parameters. We performed the same 
analysis with 10 sets of twice less galaxies and measured the 
fluctuations to be very similar. 



Figure [15] shows the evolution of the number of galaxies, the 
bias, rms, and the parameter rj as functions of z s , for three values 



7 The interquartile range is the interval spanning the second and third 
quartiles. 



To test the method with best photometric quality we require each 
galaxy to be "observed" in each band with good precision m x < 
ms t x- This requirement leaves us with about 125 000 galaxies in 
the test sample and 500 000 in the analysis sample. 
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Fig. 15. CFHTLS spectro-photometric data. Top: evolution of the fraction of galaxies and the bias with z s . Bottom: evolution of the 
rms and r\ as a function of z s . With Case A (calibrated with data) shown by the solid lines and Case B (calibrated with simulation) 
by the dashed lines. The different values of Lr c for Case A are reported in the legend, and for Case B in parentheses in the legend. 




Fig. 16. Distribution of z p - z s versus z s for a simulated LSST catalog, for all galaxies (left) and for galaxies with likelihood ratio 
Lr greater than 0.98 (right). 



Figure 16 shows the 2D distributions from the LSST simula- 
tion of Zp — Zs as a function of z s for all the galaxies in the sample 
compared to the same distribution after performing an Lr selec- 
tion. It is clear that selecting on the likelihood ratio enhances the 
photometric redshift purity of the sample. 



Figure 17 (the same as Fig. 15 for CFHTLS) shows the evo- 
lution with z s of the number of galaxies retained in the LSST 
sample, for each of the parameters listed above (bias, rms, 77). 



This indicates the quality of the photo-z, for different values of 
Lr, c - 

The LSST specifications on the bias and rms (see table |2]i 
are fulfilled up to z s = 1.5 with only a low value of Lr c > 0.6. 
For redshifts greater than 1.5, a higher value of Lr c is required 
in order to reach the expected accuracy. There are two main rea- 
sons for this. Firstly, only a small percentage of the galaxies with 
Z s > 1.5 are used to calibrate the densities P(ji\G) and P([i\0), 
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z 



Fig. 17. LSST Lr selection. Top: evolution of the fraction of galaxies and the bias with z s - Bottom: evolution of the rms and rj as a 
function of z s - The thick black lines represent the LSST requirements given in table|2] Ten years of observations with the LSST is 
assumed. The values of Lr c are reported in the legend. 



therefore the high-redshift galaxies do not have much weight in 
the calibration test. Secondly, the ratio between the height of the 
distribution P(Lr\G) at Lr — and at Lr — 1 tends to increase 
with the redshift, meaning that the purity of the test is degraded. 
Finally, only a very low value of Lr c > 0.2 is needed to enable 
r] to meet LSST specifications for z s > 2.2. 

The effect of a selection on the likelihood ratio Lr can be 
compared to the effect of a selection on the apparent magni- 
tude in the /-band, as shown in Fig. 18 The increase in the i- 



magnitude selection efficiency at large z shown in Fig. 18 (top) 
is due to the value of i cut approaching the detection threshold. 
Performing a selection on a quantity other than magnitude, such 
as the likelihood ratio, ensures that "well measured" but faint 
galaxies are still included in the sample. 

Of course requiring observation in six bands will exclude 
the M-band drop out galaxies at high redshift, so the photo-z 
performance at these redshifts will be greatly affected by this 
requirement. The next section investigates the photo-z perfor- 
mance when observations are required in less than six bands. 

6.2.2. Observation in five bands (and less) 

The previous subsection demonstrated our results for good pho- 
tometric data with nix < ms.x in all six bands of the LSST. In 
order to detect more galaxies and extend our reconstruction at 
higher redshift, we release the constraint on the number Nms of 
"well observed" bands having nix < ni=,x- Both test and analy- 



Table 5. LSST number of galaxies, comparison between the 
selection on Lr and the selection on i band magnitude. 
Observations were required in at least 5 bands. 





Total 


0<z< 1 


1 <z<2 


2<z<3 


L Rc = 
L Rc = 0.98 
i < 24 


893123 
426667 
314461 


537016 
311997 
263207 


326571 
110960 
50084 


29536 
3710 
1171 



sis sample are made of galaxies with nix < ms,x in at least N m s 
bands. We decreased N,„s from 6 to 5, 4 and 3 and performed 
similar analysis to what was presented in the previous subsec- 
tion. The comparison of the results indicates that the selection 
with Af„,5 = 5 gives the best results. As we can see in Fig. 19 
lowering N m s from 6 to 5 greatly increases the number of galax- 
ies we keep in our sample without significantly degrading the 
reconstruction performance. The gain in the number of galaxies 
is presented in tableland increases with redshift as expected. 

For galaxies "observed" in only 5 bands, the band X which 
has nix > '«5,x or is not observed at all (noise level) is the u band 
in 95% of the cases. 

When requiring less than 5 bands, the results are worse or 
similar: in order to reconstruct decent photometric redshifts in 
this case we need to apply such a large value of Lr for the selec- 
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Table 6. LSST number of galaxies (LR C = 0.98) 





Total 


0<z< 1 


1 <z<2 


2<z< 3 


N mS = 6 


296715 


235059 


60818 


838 


N m5 = 5 


426667 


311997 


110960 


3710 


N m5 = 4 


533363 


375966 


149748 


7649 


N m s = 3 


534307 


376772 


149715 


7819 



tion that we reject nearly all the galaxies gained from the weaker 
requirement, and even discard well-measured galaxies. 

Figure 20 shows the comparison of number of galaxies, and 
photo-z performance (rms, bias, 77) for a L« selected sample 
(L« > 0.98) and a magnitude selected sample (z < 24). While 
both samples satisfy the LSST science requirement given in ta- 
ble [2] the Lr selection is more efficient, since it retains a signifi- 
cantly larger number of galaxies, specifically for z > 1 (see table 
[5]). We do not present a comparison with a sample selected by a 
magnitude cut of i < 25.3 since it would not satisfy the LSST 
photo-z requirements according to our simulation and photo-z 
reconstruction (as can be seen in Fig. 18 1. 



7. Photo-z performance with neural network 



It has been shown using the public code ANNz by |Collister & 
Lahav (2004) that the photometric redshifts can be correctly es- 



timated via a neural network. This technique, among other em- 
pirical methods, requires a spectroscopic sample for which the 
apparent magnitudes and the spectroscopic redshifts are known. 
The Toolkit for Multivariate Analysis ( |Hoecker et al.|2007 



TMVA) provides a ROOT-integrated environment for the pro- 
cessing, parallel evaluation and application of multivariate clas- 
sification and multivariate regression techniques. All techniques 
in TMVA belong to the family of supervised learning algorithms. 
They make use of training events, for which the desired output is 
known, to determine the mapping function that either describes 
a decision boundary or an approximation of the underlying func- 
tional behavior defining the target value. The mapping function 
can contain various degrees of approximations and may be a sin- 
gle global function, or a set of local models. Among artificial 
neural networks, many other algorithms, such as boosted deci- 
sion trees or support vector machine are available. An advantage 
of TMVA lies in the fact that different algorithms can be tested 
at the same time in a very user-friendly way. 

7.1. Method 

The MultiLayer Perceptron (MLP) neural network principle is 
quite simple. It builds up a linear function that maps the observ- 
ables to the target variables, in our case, the redshifts. The co- 
efficients of the function, namely the weights, are such that they 
minimize the error function that is the sum over all galaxies in 
the sample of the difference between the output of the network 
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Fig. 19. LSST reconstruction for different Nms requirement. Top: evolution of the fraction of galaxies and the bias with z s - Bottom: 
evolution of the rms and 77 as a function of z s The thick black lines represent the LSST requirements given in table [2] Ten years of 
observations with the LSST is assumed. The values of N m s and LR C are reported in the legend. 



and the true value of the target. Two samples of galaxies are 
necessary, the training and the test sample. The latter is used to 
test the convergence of the network and to evaluate its perfor- 
mance. It usually prevents over-training of the network, which 
may arise when the network learns the particular feature of the 
training sample. 

A neural network is built with layers and nodes. There are at 
least two layers, one for the input observables x and one for the 
target zmlp- Each node of a layer is related to all nodes from the 
previous layer, with a weight w, which is the coefficient associ- 
ated with the activation function A of each connection. The value 
y 1 ? 1 of the node k from the layer i + 1 is related to the values of 

all /. from the layer i: 



= A 



Z 

V,/=i 



w j k yj 



where n is the number of neurons in the i layer. For the purpose 
of this paper, the activation function A is a sigmoid. As an exam- 
ple, in the case where there is only one intermediate layer, the 

photometric redshift of the g^ galaxy is estimated as follows: 



ZMLP,g - ^ 



jnimml 

z 



X W; 



The error function is simply defined by 
E(w) = 2 2j E g( x 8' w ^ > 



with 



E g (x g ,w) = ZMLP,g(Xg,w)-Z s , g , 

where n (rai „ is the number of galaxies in the training sample and 

g denotes the g^ galaxy. At the first iteration, the weights have 
random values. The gradient descent method, which consists of 
modifying the weight value according to the derivative of E with 
respect to the weight, is used to minimize E. For example, after 
one iteration, we have: 



Awl, = 



w 2 n + Aw 2 n , 
y dE g 
~ a ^i dw]. 



The parameter a is the learning rate, and has to be determined for 
each specific case. It must not be too large, otherwise the steps 
are so large that the minimum of E is never reached. It must not 
be too small either, otherwise too many iterations are required. 
The testing sample is used as a convergence and performance 
test. Indeed, the errors decrease with the number of iterations 
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Fig. 20. LSST comparison between Lr c — 0.98 and i < 24 cuts. Top: evolution of the fraction of galaxies and the bias with z s - 
Bottom: evolution of the rms and 77 as a function of z s - The thick black lines represent the LSST requirements given in table|2] Ten 
years of observations with the LSST is assumed. 
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Fig. 21. Comparison between the template-fitting method and the neural network for CFHTLS data. The bias, the rms of the dis- 
tribution of Az/(1 + Z s ), and the parameter rj are displayed as functions of the true redshift for the CFHTLS data. Data points are 
reported only if the number of galaxies in the sample is greater than ten. 



in the training sample, but reach a constant value on the testing 
sample. Weights are finally kept when the errors on the testing 
sample reach a constant value. 

For this non-exhaustive study, we have chosen the observ- 
ables x - (m, cr(m)), and two layers of ten nodes each. In 
Fig. ~ 



21 



the bias, rms and the outlier rate rj are compared for 
the template-fitting method and for the neural network. It is 
clear that the outlier rate is much smaller at all redshifts when 
the photo-z is estimated from the neural network. The disper- 
sion of the photometric redshifts is also smaller for the neural 
network when compared to the template-fitting method. These 



characteristics are expected because the training sample is very 
similar to the test sample, whereas the template-fitting method 
uses only a small amount of prior information. Moreover, in the 
template-fitting method, the apparent magnitudes are fitted with 
a model, with a SED templates, whereas to run the neural net- 
work no theoretical model have to be assumed. However, these 
attributes may be reversed if the two samples are different, as 
will be illustrated in Section 7.3 The over-estimation of photo-z 
at low redshifts, and under-estimation at high redshifts, shown 
by the downward slope in the bias, can be attributed to atten- 
uation bias. This is the effect of the measurement errors in the 
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observed fluxes resulting in the measured slope of the linear re- 
gression to be under-estimated on average; see Freeman et al. 
(2009 ) for a full discussion of this bias. We note that the photo-z 
bias obtained from the template-fitting method has an opposite 
sign, and is of the same amplitude, to that obtained from the neu- 
ral network method. Since we have reason to expect the neural 
network has a downward slope in the bias, this indicates that the 
two estimators can be used complementarily. This is investigated 
in the next subsection. 



7.2. Results for CFHTLS 

Here, a possible combination of photo-z estimators (from the 
template-fitting method on the one hand and from the neural 
network on the other hand) is outlined. Even if the fraction of 
outlier galaxies is smaller with a neural network method than for 
the template-fitting method, using only the neural network to es- 
timate the photo-z appears not to be sufficient to reach the strin- 
gent photo-z requirements for the LSST, especially when the 
spectroscopic sample is limited. In fact, neural networks seem to 
produce a photo-z reconstruction slightly biased at both ends of 
the range (see Fig.[2T]>. This is due to the galaxy under-sampling 
at low and very high redshifts in the training sample. In case 
spectroscopic redshifts are available, it is therefore worth com- 
bining both estimators. 

With the CFHTLS data, there is a correlation between z p - 
Zmlp and z p - z s , where z p is the photo-z estimated with the 
template-fitting method, as displayed in Fig. 22 This correlation 
could be used to remove some of the outlier galaxies for which 



the difference between z p and z s is large, for example, by re- 
moving galaxies with around \z p - Zmlp\ ^ 0.3. This correlation 
appears because, in the case where the neural network is well 
trained, as is the case here, the photo-z is well estimated and 
Zmlp becomes a good proxy for z s . 

One can see an example of the impact of using both estima- 
tors z p and zmlp in Fig. 23 The distribution of z p - z s is plotted 



for three cases: with L R > 0.9 only, with \z p - Zmlp\ < 0.3 only 
and with both cuts. There are fewer outlier galaxies from the first 
to the third case. This shows that if the training sample is repre- 
sentative of the photometric catalog, neural networks have the 
capability to tag galaxies with an outlier template-fitted photo-z. 
However this will be difficult to achieve in practice because the 
training sample is biased in favor of bright, low redshift galax- 
ies, that are in majority selected for spectroscopic observations. 
Selecting with both variables improves the photo-z estimation 
compared to a cut only on \z p - Zmlp\ but to a lesser extent. 

7.3. Results for the LSST 

For the LSST simulation the network was composed of 2 layers 
of 12 nodes each, the training sample was composed of 10 000 
galaxies and the testing sample of 20 000 galaxies. We found that 
increasing the size of the training sample above 10 000 showed 
no improvement in the precision of the training. We attribute 
this to the regularity of the simulation: the galaxies were drawn 
from a finite number of template SEDs, therefore as soon as the 
sample represents all the galaxy types in the simulation, adding 
more galaxies does not help in populating the parameter space 
any longer. 

A scatter plot of photo-z versus spectroscopic redshift is 
shown on the top panel of Fig. 24 The black points show the 
results from the template-fitting method, where a selection of 
Lr > 0.98 was applied, and the red points show the results 
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Fig. 22. 2D histogram of z p - Zmlp vs. z p ■ 
of the CFHTLS data. 
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Fig. 23. Normalized histograms of z p - z s with L R > 0.9 (black 
curve), \zmlp - Z p \ < 0.3 (red curve) and both cuts (blue curve) 
for the CFHTLS data. 



from the neural network described above. The plot compares 
the photo-z performance of the neural network method and the 
template-fitting method on the simulated LSST data. Similar to 
Mng al et al.| ( |20T7] >, we find that the neural network results in 
fewer outliers, although it has a larger rms for "well measured" 
galaxies than the template-fitting method. 
In the bottom panel of Fig. 



Zmlp and z p 



24 the correlation between 



is shown. Here the correlation between 



both estimators is less useful for identifying outliers than it was 
for CFHTLS. This is presumably due to both the simulation and 
the fit being performed with the same set of galaxy template 
SEDs. This should significantly reduce the fraction of outliers 
compared to a case where the templates used to estimate z p do 
not correctly represent the real galaxies. For example, removing 
some of the templates from the z p fit reduces the photo-z quality, 
as demonstrated in Benitez (2000). Therefore the existence of a 
strong correlation between z p —Zmlp and z P — z s may be useful in 
diagnosing and mitigating problems with the SED template set. 

It is difficult to obtain a spectroscopic sample of galaxies 
that is truly representative of the photometric sample, in terms 
of redshifts and galaxy types ( Cunha et al.|2 012). For example, 
in the case of the LSST, the survey will be so deep that spec- 
troscopic redshifts will be very hard to measure for the majority 
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Fig. 24. Top panel: z p vs. z s with the template-fitting method in 
black (selection with Lr > 0.98 ), and with the neural network in 
red, for an LSST simulation of ten years of observations. Bottom 
panel: 2D histogram of Zmlp ~ Z p as a function of z p - z s - 



of faint galaxies or those within the "redshift desert". Here, we 
briefly investigate the effect of having the spectroscopic redshift 
distribution of the training sample biased with respect to the full 
photometric sample. 

The fact that the distribution of redshifts in the spectroscopic 
sample is different from the underlying distribution is often (con- 
fusingly) termed "redshift bias". The consequence of this bias 
can be seen by modifying the efficiency of detection as a func- 
tion of the redshift. The efficiency function is chosen to be 



e( z ) = 1 - l/(l+e-(^)) 



(7) 



and it is plotted in Fig. 25 (inset). This efficiency function is then 
used to bias the training sample and the test sample, in order 
to compute new network weight coefficients. The photometric 
redshifts for another unbiased sample are then computed using 
these weights. 

The scatter plot of Zmlp - z s as a function of z s is shown 



in Fig. 25 We find that the photometric redshifts are quite well 
estimated as long as e > 0.2. This figure shows qualitatively that 
a bias in the training sample has a major impact on the photo-z 
reconstruction performance by the neural network, at least with 
the training method used here. 



Fig. 25. Zmlp-Zs as a function z s for the ten years of observations 
of the LSST. The curve in the inset shows the efficiency function 
e as a function of the redshift, as it is used on the training sample 
to force a bias in the redshift selection. 



8. Discussion and future work 

In regard to simulations undertaken here, there are a number of 
simplifications that will be reconsidered in future work. We dis- 
cuss briefly some of these here. 

- Point source photometric errors: We have assumed photo- 
metric errors based on estimates valid for point sources, and 
since galaxies are extended sources we expect the errors to 
be larger in practice. We made an independent estimate of 
the photometric errors expected for the LSST, including the 
error degradation due to extended sources. We found that for 
the median expected seeing, the photometric error scales as 
o~f/F = 0/0.7 where o~f is the error on the flux F, and is 
the size of the galaxy in arcseconds. The next round of sim- 
ulations will therefore include a prescription for simulating 
galaxy sizes in order to improve our simulation of photomet- 
ric errors. We will also compare our simple prescription to 
results obtained from the LSST image simulator (ImSim). 

- Galactic extinction (Milky Way): Our current simulations ef- 
fectively assume that i) Galactic extinction has been exactly 
corrected for, and ii) our samples of galaxies are all drawn 
from a direction of extremely low and uniform Galactic ex- 
tinction. In practice, there will be a contribution to the photo- 
metric errors due to the imperfect correction of the Galactic 
extinction, and this error will vary in a correlated way across 
the sky. More problematically, the extinction has the effect of 
decreasing the depth of the survey as a function of position 
on the sky. To account for these effects we will construct a 
mapping between the coordinate system of our simulation, 
and Galactic coordinates in order to apply the Galactic ex- 
tinction in the direction to every galaxy in our simulation. 
We can use the errors in those Galactic extinction values to 
propagate an error to the simulated photometry. 

- Star contamination: M-stars have extremely similar colors 
to early type galaxies and as such can easily slip into photo- 
metric galaxy samples. Taking an estimate for the expected 
LSST star-galaxy separation quality, we plan to contaminate 
our catalog with stars. This could have an important effect by 
biasing the clustering signal of galaxies since the contamina- 
tion will be worse closer to the Galactic plane. 
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- Enhanced SEDs: Our current simulations are probably more 
prone to problems with degeneracies in color space because 
we use a uniform interpolation between the main type SEDs. 
This may lead to poorer photometric redshifts than would be 
expected in reality, since galaxies might not exhibit such a 
continuous variation in SED type. In the future we plan to 
implement a more realistic interpolation scheme as well as 
the use of more complete template libraries, such as synthetic 
spectral libraries. 

Further improvements to the photo-z determination can be 
made by the use of angular cross-correlation between objects 
in the photo-z sample, and objects with spectroscopic redshifts, 
located in the same area of the sky (see Matthews & Newman 
( |2010| l and Matthe ws & Newm an ( 201 1 1). This cross-correlation 
will help to characterize the redshift distribution of the photo- 
metric sample, even though the spectroscopic sample may be 
incomplete or otherwise not closely resemble the photometric 
sample. 



9. Conclusion 

We have developed a set of software tools to generate mock 
galaxy catalogs (as observed by the LSST or other photomet- 
ric surveys) and to compute photometric redshifts and study the 
corresponding redshift reconstruction performance. 

The validity of these mock galaxy catalogs was carefully 
investigated (see section [4j. We have shown that our simula- 
tion reproduces the photometric properties of the GOODS and 
CFHTLS observations well, in particular the number count, 
magnitude and color distributions. We developed an enhanced 
template-fitting method for estimating the photometric redshifts 
which involves a likelihood ratio statistical test using the pos- 
terior probability functions of the fitted photo-z parameters (z, 
galaxy type, extinction . . . ) and the galaxy colors. 

This method was applied both to the CFHTLS data and to 
the LSST simulation to derive photo-z performance, which was 
compared to the photo-z reconstruction using a multilayer per- 
ceptron (MLP) neural network. We have shown how results from 
our template fitting method and from the neural network might 
be combined to provide a galaxy sample of enriched objects with 
reliable photo-z measurements. 

We find our enhanced template method produces photomet- 
ric redshifts that are both realistic, and meet LSST science re- 
quirements when the galaxy sample is selected using the likeli- 
hood ratio statistical test. We have shown that a selection based 
on the likelihood ratio test performs better than a simple selec- 
tion based on apparent magnitude, as it retains a significantly 
larger number of galaxies, especially at large redshifts (z > 1), 
for a comparable photo-z quality. 

We confirm that LSST requirements for photo-z determina- 
tion, i.e. (2 - 5)% dispersion on the photo-z estimate, with less 
than ~ 10% outliers can be met, up to redshift z <2.5. A num- 
ber of enhancements for the mock galaxy catalog generation and 
photo-z reconstruction have been identified and were discussed 
in section 8. 

The photo-z computation presented here is designed for a 
full BAO simulation aiming at forecasting the precision on the 
reconstruction of the dark energy equation-of-state parameter. 
This will be presented in a companion paper (Abate et al., in 
prep.). 



GOODS mock galaxy catalog and Eric Gawiser for his help with the IGM cal- 
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